Explanatory Data Analysis Using Plotly-Express

Background

One morning, you a data analyst get new dataset, and at afternoon you need present some insight from the data to your supervisor, unfortunately sometimes this happens. The best way to present some insight of course with visualisation, gladly there is this visualisasion library called plotly-express that can help you with that. Plotly Express is a terse, consistent, high-level API for rapid data exploration and figure generation. It's suppose help you to visualize your data quick and easy. So I creating this course book to give some demo on how to use this library to explore your data and answer some of your bussiness question quickly.

The data

The dataset we use consists of the marks secured by the students in various subjects, which accessible from Kaggle Student Performance in Exams.

The Inspiration is to understand the influence of the parents background, test preparation etc on students performance. It comprises of 1,000 rows and 8 columns:

  • gender
  • race / ethnicity
  • parental level of education - Bachelor's degree, master's degree, or some college
  • lunch - standard or free/reduced
  • test preparation course - none or completed
  • math score
  • reading score
  • writing score

Libraries

The libraries we use today is pandas and plotly_express. You can install it by pip install plotly-express

In [114]:
import pandas as pd
import plotly_express as px

Reading the data

As usual we will read the data. What I do here are changing the coloumn name to more easier to use name and see how's the data looks.

In [56]:
df = pd.read_csv('data_input/StudentsPerformance.csv')
df.columns = ['gender', 'ethnicity', 'parental_level_of_education','lunch','test_preparation_course','math','reading','writing']
df.head()
Out[56]:
gender ethnicity parental_level_of_education lunch test_preparation_course math reading writing
0 female group B bachelor's degree standard none 72 72 74
1 female group C some college standard completed 69 90 88
2 female group B master's degree standard none 90 95 93
3 male group A associate's degree free/reduced none 47 57 44
4 male group C some college standard none 76 78 75

Here we see what is the coloumn data type.

In [57]:
df.dtypes
Out[57]:
gender                         object
ethnicity                      object
parental_level_of_education    object
lunch                          object
test_preparation_course        object
math                            int64
reading                         int64
writing                         int64
dtype: object

First Exploration

As usuall first, let's see the distribution of our categorical data.

In [65]:
print(df.gender.value_counts(),"\n\n",
      df.lunch.value_counts(),"\n\n",
      df.ethnicity.value_counts(),"\n\n",
      df.parental_level_of_education.value_counts(),"\n\n",
      df.test_preparation_course.value_counts(),
     sep='')
female    518
male      482
Name: gender, dtype: int64

standard        645
free/reduced    355
Name: lunch, dtype: int64

group C    319
group D    262
group B    190
group E    140
group A     89
Name: ethnicity, dtype: int64

some college          226
associate's degree    222
high school           196
some high school      179
bachelor's degree     118
master's degree        59
Name: parental_level_of_education, dtype: int64

none         642
completed    358
Name: test_preparation_course, dtype: int64

From what we see at this distribusion of our categorcial coloumn, here some insight we can take of:

  • it's quite distribute equally gender wise.
  • Most of the students have better quality of lunch,
  • Most of parent's don't have very high education level.
  • Most of people don't take test preparation course.

Next let's see dthe distribusion of our numeric coloumn

In [115]:
df.describe()
Out[115]:
math reading writing
count 1000.00000 1000.000000 1000.000000
mean 66.08900 69.169000 68.054000
std 15.16308 14.600192 15.195657
min 0.00000 17.000000 10.000000
25% 57.00000 59.000000 57.750000
50% 66.00000 70.000000 69.000000
75% 77.00000 79.000000 79.000000
max 100.00000 100.000000 100.000000

From what we see at this distribusion of our categorical coloumn, here some insight we can take of:

  • Math have lower average score than other material
  • most of student score around 68

Question

After first exploration, we have a couple question that we can answer with this data, for the demo let's answer the these 2 question:

  • Is a certain gender excels in certain subject?
  • Is there a specific ethnicity that better at math?

Answering the question

Before we start to answer question, let's first see the distribusion of our subject (gender and ethnicity). First plot we use from our library is bar plot. Bar plot is one of the most efective plot to answer a lot of question. At plotly library we can use bar(dataframe, x, y). The only parameter you need is dataframe and x.

In [116]:
fig = px.bar(df, x= 'gender')
fig.show(renderer="notebook")

That's how you make bar plot, then our first objective are to see how the distribusion of the ethnicity and gender. We can put etchincity as color to diffrenciate them.

In [117]:
fig = px.bar(df,
             x= 'gender',
             color='ethnicity')
fig.show(renderer="notebook")

We can't get clear picture from that plot, because the bar is stacked, we can change it with barmode = parameter. The default of is 'relative' to make the unstacked we can use 'group'.

In [118]:
fig = px.bar(df,
             x= 'gender',
             color='ethnicity', 
             barmode='group')
fig.show(renderer="notebook")

Now we already can see how their distribusion, but let's beautify our plot for a bit. First you can give the plot some theme, with template parameter, you can use the theme template they provides for example ploty_dark or plotly_white you can check out their documentation for more theme, at this one I'll use my favorite plotly_white.

Then if you noticed the order of their group is ungroup, we can reorder with category_order parameter that can accept dictionary. It will automaticly detect and order your category if you give correct category.

Lastly you can always give title to your plot.

In [119]:
fig = px.bar(df,
             x= 'gender',
             color='ethnicity',
             template='plotly_white', 
             barmode='group',
             category_orders={'ethnicity':["group A","group B","group C","group D","group E"]},
             title= "Ethnicity Distribution on Gender")
fig.show(renderer="notebook")

You can actually make those plot in one line, but I put some enter so it's more user friendly. As we can see from the plot, we can se that it actually have a quite similiar distribusion between female and male, but at our dataset ethnicity C dominates. So with that distribution now it's safe to assume we can analize the gender in our dataset equaly.

Question 1

So, let's answer our first question, is a certain gender excels in certain subject?

To answer this qeustion we will take math and reading subject, why? because I like both of subject. Just kidding, I math and reading because they are the subject with lowest average and the highest average. To answer the question we can use the scatter plot, yes conviniently we can use other plot to answer this question. To make scatter plot, you can guess we can use scatter function. We can make math and reading as x and y, next we can color them with the gender so we can see if there is some difference to answer our question and as usual I'll use the plotly_white template.

In [121]:
fig = px.scatter(df,
                 x='math',
                 y='reading', 
                 color ='gender',
                 template='plotly_white',
                 title="Is a certain gender excels in certain subject?")
fig.show(renderer="notebook")

Oh we already get the answer from the plot, before we answer it we will give the marginal plot to see the distribusion of the score. How to see the distibusion? We can use the bar plot, to make a histogram.

In [122]:
fig = px.scatter(df,
                 x='math',
                 y='reading', 
                 color ='gender',
                 marginal_x='histogram',
                 marginal_y='histogram',
                 template='plotly_white',
                 title="Is a certain gender excels in certain subject?")
fig.show(renderer="notebook")

Don't be fooled by the color, usually male colored by blue, but this time it switched. You can change it if you want, you know how, but for simplicity sake male will colored as red and female is blue. As you can see from the scatter plot male is better with the math subject but female excels in reading. From our marginal plot we also can see, most of female only score average score in math, while male score mostly score below average on reading.

So the answer for our question is yes, a certain gender excels in certain subject.

Question 2

Is there a specific ethnicity that better at math?

Well you know there always a myth that a certain ethnicity better at math. Let's see if that assumtion true. We can use another type of plot, you can guess box plot also you can guess the fuction is 'box'. First let's prove if our first question true, we can check it with box plot, let's see math subject.

In [153]:
fig = px.box(df,
             x='gender', 
             y='math',
             template='plotly_white')
fig.show(renderer="notebook")

As you can see Male have higher median than female, that's the easiest way to see a certain catergory performs better with box plot, just see the median. Next let's try to answer out question. At this box plot I added one more parameter which is notched to help us see where is the median better.

In [158]:
fig = px.box(df,
             x='ethnicity', 
             y='math',
             template='plotly_white', 
             category_orders={'ethnicity':["group A","group B","group C","group D","group E"]},
             title="Is there a specific ethnicity that better at math?",
             notched=True)
fig.show(renderer="notebook")

There is a quite a lot of insight that we can take from box plot, like if there is much outlier in data, or how is the variance of the data. The point outside the whisker is an ourlier, while we can see how our data variance from how is the size of the box, if it longer then it have bigger variance. But we don't need that for answering our question, we just need to see where the median is to answer our question.

As we can see a certain ethnicity group is having much higher median, so it's save to assume, yes a specific ethnicity that better at math.

Even we already answer our initial question let see the data furher with grouping the data based on gender too. We can do it with adding facet_col parameter. Yes, we also can use that parameter at all plot, one of the advantage of the plotly express is all of the plot type have mostly same parameter, it's consitent.

In [161]:
fig = px.box(df,
             x='ethnicity', 
             y='math',
             color = 'gender',
             template='plotly_white', 
             notched=True,
             category_orders={'ethnicity':["group A","group B","group C","group D","group E"]},
             facet_col = 'gender',
             title="Is there a specific ethnicity and gender that better at math?")
fig.show(renderer="notebook")

After further investigation, actually the one that trully excel in math is male of ethnicity group E, while the other both female and male actually have similar median, but group E certainly higher than other group, but female at Group A certainly have much lower median. But if you see the max score (100) beside from group E, only male group A and D reach 100, so it's hard to say group A is worst at this subject.

Conclusion

  • Is a certain gender excels in certain subject? Yes
  • Is there a specific ethnicity that better at math? Yes

The answer is yes for both question. There is actually more plot you can create with plotly express, but here these 3 are the most useful plot to answering question, most of question can answered only just these 3, maybe the other one that always useful is heatmap. You can always read more at the documentation, which I attach at the reference.

In the end hope this article help and make you want to use this library, but the question is will you present this notebook to your supervisor? Maybe yes, with very limited time but in long run you can't just show this notebook. You will need dashboard to support your presentation I suggest you also learn about dash. That library are made to make dashboard purely with python it based on flask so it's really lightweight and yep it also fully support plotly-express. Feel free to reach me in mentor@algorit.ma or handoyo@algorit.ma if you have more question or interested to know more about this subject.

And yes that of course not only insight you can draw with this dataset, you can try to draw more insight and answering the question I give at the quiz. Have fun exploring this data! and thank you for reading.

Reference